Search CORE

19 research outputs found

Scalable Breadth-First Search on a GPU Cluster

Author: Owens John D.
Pan Yuechao
Pearce Roger
Publication venue
Publication date: 13/03/2018
Field of study

On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-degree vertices, and point-to-point transmission for low-degree vertices. Leveraging the characteristics of degree separation, we reduce the graph size to one third of the conventional edge list representation. With several other optimizations, we observe linear weak scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access system.Comment: 12 pages, 13 figures. To appear at IPDPS 201

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Multi-GPU Graph Analytics

Author: Owens John D.
Pan Yuechao
Wang Yangzihao
Wu Yuduo
Yang Carl
Publication venue
Publication date: 01/03/2017
Field of study

We present a single-node, multi-GPU programmable graph processing library that allows programmers to easily extend single-GPU graph algorithms to achieve scalable performance on large graphs with billions of edges. Directly using the single-GPU implementations, our design only requires programmers to specify a few algorithm-dependent concerns, hiding most multi-GPU related implementation details. We analyze the theoretical and practical limits to scalability in the context of varying graph primitives and datasets. We describe several optimizations, such as direction optimizing traversal, and a just-enough memory allocation scheme, for better performance and smaller memory consumption. Compared to previous work, we achieve best-of-class performance across operations and datasets, including excellent strong and weak scalability on most primitives as we increase the number of GPUs in the system.Comment: 12 pages. Final version submitted to IPDPS 201

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Gunrock: GPU Graph Analytics

Author: Davidson Andrew
Liu Weitang
Osama Muhammad
Owens John D.
Pan Yuechao
Riffel Andy T.
Wang Leyuan
Wang Yangzihao
Wu Yuduo
Yang Carl
Yuan Chenshan
Publication venue
Publication date: 04/01/2017
Field of study

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

arXiv.org e-Print Archive

eScholarship - University of California

FigShare

Performance Characterization of High-Level Programming Models for GPU Graph Analytics

Author: Owens John D.
Pan Yuechao
Wang Yangzihao
Wu Yuduo
Yang Carl
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/10/2015
Field of study

We identify several factors that are critical to high-performance GPU graph analytics: efficient building block operators, synchronization and data movement, workload distribution and load balancing, and memory access patterns. We analyze the impact of these critical factors through three GPU graph analytic frameworks, Gunrock, MapGraph, and VertexAPI2. We also examine their effect on different workloads: four common graph primitives from multiple graph application domains, evaluated through real-world and synthetic graphs. We show that efficient building block operators enable more powerful operations for fast information propagation and result in fewer device kernel invocations, less data movement, and fewer global synchronizations, and thus are key focus areas for efficient large-scale graph analytics on the GPU

Crossref

eScholarship - University of California

3-D Non-UV Digital Printing of Hydrogel Microstructures by Optically Controlled Digital Electropolymerization

Author: Gwo-Bin Lee
Haibo Yu
Lianqing Liu
Na Liu
Pan Li
Wen J. Li
Yuechao Wang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Multi-GPU Graph Processing

Author: Pan Yuechao
Publication venue: University of California, Davis
Publication date: 01/01/2019
Field of study

While modern GPU graph analytics libraries provide usable programming models and good single-node performance, the memory size and the computation power of a single GPU is still too limited for analyzing large graphs. Scaling graph analytics is challenging, however, because of the characteristics of graph applications: irregular computation, their low computation to communication ratios, and limited communication bandwidth on multi-GPU platforms. Addressing these challenges while still maintaining programmability is yet another difficulty. In this work, I target the scalability of graph analytics to multiple GPUs. I begin by targeting multiple GPUs within a single node. Compared to GPU clusters, single-node-multi-GPU platforms are easier to manage and program, but can still act as good development environments for multi-GPU graph processing. My work targets several aspects of multi-GPU graph analytics: the inputs that graph application programmers provide to the multi-GPU framework; how the graph should be distributed across GPUs; the interaction between local computation and remote communication; what and when to communicate; how to combine received and local data; and when the application should stop. I answer these questions by extending the Gunrock graph analytics framework for a single GPU to multiple GPUs, showing that most graph applications scale well in my system. I also show that direction-optimizing breadth-first search (DOBFS) is the most difficult scaling challenge because of its extremely low compute to communication ratio. To address the DOBFS scaling challenge, I demonstrate a DOBFS implementation with efficient graph representation, local computation, and remote communication, based on the idea of separating high- and low-degree vertices. I particularly target communication costs, using global reduction with bit masks on high-degree vertices and point-to-point communication to low-degree vertices. This greatly reduces overall communication cost and results in good DOBFS scaling with log-scale graphs on more than a hundred GPUs in the Sierra early access system (the testing bed for the Sierra Supercomputer). Next, I revisit the design choices I made for the single-node multi-GPU framework in view of recent hardware and software developments, such as better peer GPU access and unified virtual memory. I analyze 9 newly developed complex graph applications for the DARPA HIVE program, implemented in the Gunrock framework, and show a wide range of potential scalabilities. More importantly, the questions of when and how to do communication are more diverse than those in the single-node framework. With this analysis, I conclude that future multi-GPU frameworks, whether single- or multiple-node, need to be more flexible: instead of only communicating at iteration boundaries, they should support a more flexible, general communication model. I also propose other research directions for future heterogeneous graph processing, including asynchronous computation and communication, specialized graph representation, and heterogenous processing

ProQuest OAI Repository

Recommended from our members

Multi-GPU Graph Analytics

Author: Owens John D.
Pan Yuechao
Wang Yangzihao
Wu Yuduo
Yang Carl
Publication venue: eScholarship, University of California
Publication date: 01/03/2017
Field of study

eScholarship - University of California

Recommended from our members

Gunrock: A High-Performance Graph Processing Library on the GPU

Author: Davidson Andrew
Owens John D.
Pan Yuechao
Riffel Andy
Wang Yangzihao
Wu Yuduo
Publication venue: eScholarship, University of California
Publication date: 15/01/2016
Field of study

For large-scale graph analytics on the GPU, the irregularity of dataaccess/control flow and the complexity of programming GPUs have been twosignificant challenges for developing a programmable high-performance graphlibrary. "Gunrock," our high-level bulk-synchronous graph-processing systemtargeting the GPU, takes a new approach to abstracting GPU graph analytics:rather than designing an abstraction around computation, Gunrock insteadimplements a novel data-centric abstraction centered on operations ona vertex or edge frontier. Gunrock achieves a balance between performance andexpressiveness by coupling high-performance GPU computing primitives andoptimization strategies with a high-level programming model that allowsprogrammers to quickly develop new graph primitives with small code size andminimal GPU programming knowledge. We evaluate Gunrock on five graphprimitives (BFS, BC, SSSP, CC, and PageRank) and show that Gunrock has onaverage at least an order of magnitude speedup over Boost and PowerGraph,comparable performance to the fastest GPU hardwired primitives, and betterperformance than any other GPU high-level graph library

eScholarship - University of California

Groute

Author: Burtscher M.
Cha Meeyoung
Davidson A.
Demetrescu Camil
Hong Changwan
Liskov B.
Nasre R.
Pan Yuechao
Pan Yuechao
Pearce Roger
Soman J.
Sutton Michael
Wang Yangzihao
West B.
Whang Joyce Jiyoung
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref